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Abstract 



We observe that the technique of Markov contraction can be used to establish mea- 
sure concentration for a broad class of non-contracting chains. In particular, geometric 
py^ ergodicity provides a simple and versatile framework. This leads to a short, elementary 

p I ■ proof of a general concentration inequality for Markov and hidden Markov chains (HMM), 

which supercedes some of the known results and easily extends to other processes such as 
Markov trees. As applications, we give a Dvoretzky-Kiefer-Wolfowitz-type inequality and 
a uniform Chernoff bound. All of our bounds are dimension-free and hold for countably 
infinite state spaces. 



1 Introduction 

■ 1.1 Background 

The last decade or so has seen a flurry of activity in concentration of measure for non- 

o 



independent processes. A recent survey may be found in [18], with pointers to more specialized 
surveys therein. Rather than recapitulating these surveys here, we shall proceed directly to 
the relevant recent developments. Let X\,X2, ... be a sequence of N- valued random variables 
obeying some joint law (distribution). Using the shorthand C(X" \ X\ = x) to denote the law 
of (Xj, . . . , X n ) conditioned on (X±, . . . , Xi) = x G N\ let us define, for n G N, 1 < i < j < n, 
y € N l_1 and w, w' G N, 

THjfawy) = \\C(X?\Xi = yw)-C(X?\X{ = yw')\\ Tv , 

(where ||-|| TV = \ \-\\ is the total variation norm) and 

fjij = sup rjij(y, w,w'). (1) 

The coefficients fjij, termed rj-mixing coefficients in [19], play a central role in several recent 
concentration results. Define A to be the upper-triangular n x n matrix, with Ajj = 1 and 
Ajj = fjij for 1 < i < j < n. 
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In 2007, [6] and [19] independently proved that for any / : N n ->• R with ||/|| Lip < 1 with 
respect to the Hamming metric 1 , we have 



P(\f-Ef\>ne) < 2exp 



2ne 2 



minillAII^IIAH^} 



where ||A|| is the £ p operator norm ([6] achieve the better constant in the exponent, given 
here). Earlier, Samson [30] had given a concentration result for convex ^-Lipschiz functions 
/ : [0, l] n — > R, which likewise involved the coefficients fjij, and these are also implicit in 
Marton's earlier work [25, 26, 27]. 

In order to apply (2) in a Markov setting, one must upper-bound ||A|| 2 or HA^ for the 
Markov chain in question. The earliest such results relied on contraction. Let p(- | •) be the 
transition kernel associated with a given Markov chain, and define the (Doblin) contraction 
coefficient 

k= sup ||p(.|x)-p(-|x , )|| TV . (3) 

x,x'eN 

It is shown in [19] and [30] that fjij < kP~ 1 and therefore HA^ < (1 — this implies the 

concentration bound 

P(\f - E/| > ne) < 2exp(-2(l - nfue 2 ) 

for 1-Lipschitz functions /, which Marton [24] had (essentially) obtained earlier by other 
means. 

The contraction method was pushed further to obtain concentration results for hidden 
Markov chains [19], undirected Markov chains and Markov tree processes [18], but its appli- 
cability requires the rather stringent condition that k < 1. Already in [25], Marton observed 
that a significantly weaker mixing condition suffices, and yields tighter and more informa- 
tive bounds. Indeed, consider a Markov chain with stationary distribution ir and conditional 
s th step distribution C(X S \X\ = x), and define the "inverse mixing time" 

T s = sap\\C(X s \X 1 =x)-ir\\ TV . (4) 
A simple calculation (Lemma 7) shows that fjij < ?>Tj_i, and thus 

n n 



|A|I — 1 = max > rijj < 2 max > r. 
1 "°° l<«n ^ 13 ~ Ki<n ^ " 

j=i+l j=i+l 



A rich body of work deals with bounding r s via spectral [14], Poincare [10], log-Sobolev [9] 
and Lyapunov [20] methods, among others (see the references in the works cited). From 
our perspective, the geometric ergodicity condition allows for the simplest exposition while 
sacrificing the least generality. A Markov chain is said to be geometrically ergodic with 
constants 1 < G < oo and < < 1 if 

T s <G6 s -\ s = l,2,.... (5) 



1 meaning: if x,y £ N n differ in only 1 coordinate then \f(x) — f{y)\ < 1 
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We remark that in the finite state case, any ergodic Markov chain is geometrically ergodic, 
and the dependence of G, 6 on various structural properties of the chain in question is the 
subject of a diverse and prolific literature (including the references above). We also stress that 
the geometric ergodicity assumption is largely dictated by expositional convenience, since any 
non-trivial bound on the inverse mixing time t s will yield straightforward analogues of our 
results. 

In this paper, we explore some consequences of geometric ergodicity as pertaining to 
concentration and statistical inference for Markov and hidden Markov chains. We leverage 
two basic insights: (i) even though hidden Markov chains are a considerably richer class of 
processes than Markov chains (there exist HMMs not realizable by any finite-order Markov 
chain), for the purposes of measure concentration, the underlying Markov chain is all that 
matters and (ii) geometric ergodicity, while significantly more general than contractivity, 
yields essentially the same concentration bounds. Another advantage of our approach is its 
elementary nature: taking the bound in (2) as a given, nothing beyond basic linear algebra is 
used. 

Given the recent interest in prediction and parameter inference for HMMs [3, 16, 29, 31], 
our result have potential to be applicable beyond the abstract setting studied here. Fur- 
thermore, since concentration results for Markov chains extend easily for other Markov-type 
processes (such as trees [18]), our results here should extend to those as well. 

1.2 Main results 

Concentration. Our first result is a concentration inequality for hidden Markov chains, 
which generalizes many of the previous such bounds. We will henceforth write "(G, 8)- 
geometrically ergodic" as shorthand for "geometrically ergodic with constants 1 < G < oo 
and < 6 < 1". Hidden Markov chains and their associated notions of stationarity and 
geometric ergodicity are formally defined in Section 2.1. 

Theorem 1. Let Y\,Y2,... be a N-valued hidden Markov chain whose underlying ^-valued 
Markov chain is (G, 6) -geometrically ergodic. Then, for any n G N and f : N n — > R with 
1 1 /I I Lip < 1 (under the Hamming metric), we have 



with an identical bound for the other tail. 

Although the result in Theorem 1 does not appear to have been published anywhere, it is 
a simple consequence of widely known facts (we give a proof in Section 2 for completeness). 
Our main contribution lies in the apparently novel applications. 

DKW-type inequality. Let us recall the Dvoretzky-Kiefer-Wolfowitz inequality [13, 28], 
stated here for the discrete case. Suppose X\, X2, ■ ■ ■ are iid N- valued random variables with 
common distribution function F, and define the empirical distribution function F n induced 




by (X 1 ,...,X n ): 



x G N. 
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The DKW inequality states that 



F n {x)-F(x) >e < 2exp(-2ne 2 ), e>0,nGN. 



P I sup 

We present the following generalization of this inequality. 

Theorem 2. Let Yi, Y2, • • • be a stationary N-valued (G, 6) -geometrically ergodic Markov or 
hidden Markov chain with stationary distribution p G M N . For n G N, define p^ G M N to be 
the empirical estimate of p: 



pin) 

> y 



1 " 



y G N. 



i=i 



Then 



p- p 



(n) 



1 + 4G0 \ ^ 

oo > V^r^) +e - exp 



n(l 



) 2 £ 2 



2G 2 



n G N,e > 0. 



(6) 



(n) 

Note that a naive application of Theorem 1 to each jr y individually, combined with the 
union bound, yields 



(l|,-^L> e )< m ^(-=<^) 



(7) 



where \\p\\ is the number of non-zero entries in p. Obviously, for stationary distributions 
with infinite support, the bound in (7) is vacuous. The assumption that the chain starts in 
the stationary distribution is not at all restrictive, as shown in Section 2.6. 

Uniform Chernoff bound. Let Y\,Y2,... be a stationary N- valued (G, 6>)-geometricaily 
ergodic Markov or hidden Markov chain as above, and consider the occupation frequency: 



1 n 



i=l 



A naive application of Theorem 1 might yield a deviation bound along the lines of 
P (\ P (E) - p^\E)\ > e ) < 2\E\ exp (-^j^) , 

where \E\ is the cardinality of E and p is the stationary distribution as above. We will give 
a much stronger bound, that is not only independent of E but is actually uniform over all 
E C N. 

Theorem 3. Define 



yfp^ + min < 

p y >l/n 



ln{G,0)Y^ yfp~y , ^ Py 

p y <l/n py<l/n t 



> , n G N, 



4 



where 



Then: 

(a) for all distributions p G M N , 

lim A n (p) = 0, 

n— >oo 



P 



sup 

ECN 



p(E)-pl n \E) >A n (p)+e) <exp 



n(l - 0) V 
2G2 



We remark that the rate at which A n (p) decays to depends on p and may be arbitrarily 
slow for heavy-tailed distributions. When X^eN \[Py~ < 00 > we S e t a simpler estimate in (b) 
via 

An( P )<7n(G,e)^2^. 

yen 

Again, the stationarity assumption is quite mild (Section 2.6). 



1.3 Related work 

In parallel to the work on concentration of measure results for Markov chains [1, 2, 7, 19, 24, 
30], grew a body of independent results on Chernoff-type bounds for these processes. The 
papers [11, 12, 15, 17, 22] played a founding role, and various extensions and refinements 
followed [21, 32]. In a remarkable recent development [8], optimal Chernoff-Hoeffding bounds 
are obtained based on the mixing time at a constant threshold. 



2 Methods and proofs 
2.1 Preliminaries 

For readability, we will sometimes write the matrix entry A x>y as A{x \ y). We will use the 
terms hidden Markov chain and HMM interchangeably. 



Markov chains. We will represent Markov kernels by column-stochastic N x N matrices 
denoted by the letter A. Thus, a Markov chain with transition kernel A and initial distribution 
pi induces the following distribution on N n : 

n-l 

C{X U ...,X n )= H A(X l+1 | Xi). (8) 

i=i 
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Hidden Markov chain. A hidden Markov chain (also known as hidden Markov model 
[HMM]) is specified by the triple {pi, A, B), where (pi,A) are the Markov chain parameters 
as above and B is an N x N column-stochastic matrix of emission probabilities. This HMM 
induces a distribution on N n as follows. Let X G N n be distributed according to (8) and 
define the conditional distribution £(• | X) over Y 6 N n : 

n 

C{Y\X) = \{B(Y i \X i ). 

It follows that 

C(Y) = ^ P(X = x)C{Y \X = x). 
We will refer to Y as a hidden Markov chain and to X as its underlying Markov chain. 



Stationary distributions and chains. The stationary distribution tt 6 IR N of the Markov 
chain with transition kernel A is the unique stochastic vector satisfying Air = tt. The Markov 
chain induced by (p±,A) is said to be stationary if p\ = tt. It is well-known that, for ergodic 
Markov chains, 

tt = lim C{X n ) = lim Evr^, 

where 

• 1 

In the geometrically ergodic case, observing that EttW = \ Y2=i C{Xi), we have 



EtT^ - TT 



- - tt) 

n ^— ' 

i=i 

< ^Eii£(*)-Tii 



1=1 

n 



n ^ 

t=i 



| Xi = x)pi(x) - TT 



x<=N 



i=l xGN 



G 



(l-fl)n' 



i=l xGN 

For a hidden Markov chain, we define the stationary distribution p = Btt, and observe that 

p = lim C(Y n )= lim Ep (n) , 

where p^ is defined in (6). Since p^> is distributed as BTT^ n \ we have 

G 



Bp^ - p 



< 



EVT^ - TT 



< 



tv (1 — 9)n 



(9) 
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2.2 Markov contraction and decoupling 

Let us recast the contraction coefficient defined in (3) in the language of Markov kernels: 

k= sup \\A(-\x) - A{-\x')\\ TV . 

The term "contraction" is justified by the following simple fact [5, 19]: 

Lemma 4 (Markov, 1906 [23]). For any two stochastic vectors € K N ; we have 

\m-^\\ TV < K u-4>\\ TV . 

Our principal application of this result will be in the context of geometrically ergodic 
Markov kernels. 

Corollary 5. Let A be a (G, 9) -ergodic Markov kernel. Then for all n G N, the n-step kernel 
A n has contraction coefficient k < 2G6 n . 

Proof. Let 7r be the stationary distribution of A and £, tp £ two point masses. Then 

P^-^VIU < P^-7r|| TV + P>-7r|| TV 
< 2r n+ i < 2G9 n . 

□ 

Our next result expands upon the observation in (9) that to a large degree, the statistical 
behavior of an HMM is controlled by its underlying Markov chain. 

Lemma 6. Let X and X' be two Markov chains induced by (£,A) and (£',A'), respectively. 
For a given emission matrix B, letY and Y' be the hidden Markov chains induced by (£, A, B) 
and {£',A',B). Then 

\\C(Y ieI ) - £(y/ G/ )|| TV < \\C(X ieI ) - C(X> eI )\\ Tv , I C {1, . . . ,n} ,n G N. 

Proof. The following convention will be used for discontiguous indices. For / = {ii,i2, ■ ■ ■ , i\i\} C 
N and m£N ; , the coordinates of u are indexed as 

u = {Ui)i£l = { u ii i Ui 2 , • • • , ) 

and not as u = ( , Ui)i<i<|7| = (u\,U2, ■ ■ ■ , iti/i). 

Let Q and Q' be the probability measures induced on N J by X and X', respectively, and 
let P and P' be the probability measures induced by Y and Y' . Put J = {1, . . . , n} \ I. 
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For w G and z G N" 7 , define a; [to, z] G N /Ul7 to be such that Xj = u;j for i G / and 
Xj = Zj for j G J. For u G N 7 , v G N J , we define y[u,v] G N /u ' 7 analogously. Then 



\\C(Y iEI ) - C(Y> el ) 



e e e n^n^^w*^-^* 2 ])) 

«eN J «jgn 7 zeN J «gJ j'g J 



2 ^ 

- \ e e 



£ (Q(xKz])-Q'(xKz])) 



., eN ./ 



^ (Q(x[^z])-Q'(xKz])) 



= ||£(X i67 ) - £(X; e/ ) 



□ 



2.3 Proof of main inequality 

In this section, we prove Theorem 1. The first order of business is to bound the 77-mixing 
coefficient by the inverse mixing time, and hence in terms of G and 9. 

Lemma 7. Let Y be a (G, 6) -geometrically ergodic hidden Markov chain and let fjij and t s 
be as defined in (1) and (4), respectively. Then 

fjij < 2r j _ i+ i < 2G6 j ~\ n G N, 1 < i < j < n. 

Proof. Let X be the Markov chain underlying Y and endow fjij{X), fjij(Y) with the obvious 
meaning. Then [19, Theorem 7.1] shows that 

fjij(Y) < fjijiX). 

Next, Remark 4 and the Theorem preceding it in [18] show that 

fHj{X) < n{A^) 

where k(A^~ 1 ) is the contraction coefficient of the (j — i)-step Markov kernel of X. Finally, 
Corollary 5 yields 

K{A j - 1 ) < 2r i _ i+ i < 2G6 j -\ 



□ 
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Proof of Theorem 1. By (2), it suffices to upper-bound 



n 

|A|L = l+max £ fjir 
j=i+i 



Applying Lemma 7, we get 



max > riij < 2G max > fl- 7 

Kkti J Ki<n 

j=j+l j=i+l 

oo 



k=i 

Since G > 1 by assumption, we have 

oo oo 
k=l k=0 

< 2G 



l-i 



□ 

2.4 Proof of the DKW-type inequality 

In this section, we prove Theorem 2. Let Yi, Y2, . . . be a stationary (G, #)-geometrically ergodic 
hidden Markov chain with stationary distribution and define the {0, l}-indicator variables 

g v) = l {Yi=v} , i,y€N. (10) 

Then p, defined in (6), is given by p y = ^ Yl7=i £i^> where we have dropped the superscript 
(n) from p for readability. Observing that the map (Y\, . . . ,Y n ) 1— > n \\p — p|| is 1-Lipschitz 
under the Hamming metric, we apply Theorem 1: 

f «(l-^) 2 e 2 
P(||p-p|| 00 >E||p-p|| 00 + e)<exp - 



2G 2 



Hence, it remains to bound E \\p — p\\ 
Lemma 8. 



e||p-pIL < 



/ 1 + AGO 

n(i-ey 



Remark. This estimate is nearly tight: in the case where Y{ are iid (i.e., 9 = 0) Bernoulli 
variables with parameter p, we have [4, Lemma 6] 
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Proof. Jensen's inequality yields 



< E 



J2\py-Py\ 



E E K ~ Pvf = £Var[ / 3„]. 

y£N yeN 



Putting = E^i^ we have 



TV 



! Var[ / > J/ ]=E(^)) 2 -(E^)) 1 



and 



ESP 



n = n Py 



To bound E(s^y , we compute 



E 



E 



l<i,j<n 



2 



i=i 



n Pj/ + 2^E[^ 



l<i<j'<n 



l<i<j<n 

where the last identity holds since ^ G {0, 1}. It now remains to estimate E 
this end, we claim that 



(11) 



(12) 



(13) 



Ay) Ay) 



(14) 



To 



||£(y j |y 1 = y)-p|| 00 <2G^- 1 , 
Indeed, denoting the parameters of Y by (7r, A, 5) and letting X be the underlying Markov 
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chain, we have 

\\/:(Y i \Y 1 = y 1 )-p\\ 00 < 2||£(r i |Yi=y 1 )-p|| TV 



J2\P(Y t = y l \Y 1 =y 1 )-p yt 



E 



= J] |P(X i = a; i |yi = i/i)-ir a . i | 



2 



C(Xi | Xi = xi)P(Xi =x 1 \Y 1 = yi ) 

xjeN 

< 2 sup \\£{Xi | X! = xi) - vr|| TV < 2G0 <_1 . 



Hence, 



E 



si Sj 



= P(y 1 =y,y j _ m = y ) 

= P(Yi = y)P(y,- i+ i = y | y = y) 
< /9y (p 2/ + 2G^- i ), 



and therefore 



(y) 



n-l 



£(n - k)P(Y 1 = y)P{Y k+1 = y\Y l = y) 



l<i<j<n 



k=l 
n-l 



< Y^(n-k)p y (p y + 2G6 k ) 
k=l 



n(n — 1) 



pi + 2 



G6> 
1^6 



n 



1 - 



n(n - 1) 2 G0 



2 ry 

Combining (12), (13), (14), and (15), we have 



1 



1 / GO \ 

Var[p y ] < ^ynp y + n(n-l)pl + An-^—Qpy-n 2 p 2 y \ 



< 



n \ 
1 + 4G6 



AGO 

i - e 



py n {i-ey 

Since YlyeN Py = ^-> * ne claim follows from (11). 
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Remark. Note that in the process of proving a deviation estimate on — Pllcxs' we nave 
actually proven a stronger one — namely, for the £2 norm. 

2.5 Proof of the uniform Chernoff bound 

In this section, we prove Theorem 3. As before, Y"i, Y2, . . . is a stationary (G, 6>)-geometricaily 
ergodic hidden Markov chain with stationary distribution p. As noted in [4, Lemma 7], the 
map (Y"i, . . . , Y n ) i-> n \\p — p|| TV is 1-Lipschitz under the Hamming metric, and so Theorem 1 
applies: 

P(\\p - p|| TV > E ||p - p|| TV + e) < exp (- n(1 ~ G ? 2g2 ) • (16) 

As before, the crux of the matter is to bound E \\p — p|| TV . Recall the definition of A n from 
the statement of Theorem 3. 

Lemma 9. 

e IIp-/5|Itv ^ A «- 

Proof. We proceed by breaking up the expectation into two terms 

e||p-pI| tv = ±J2 E \py- Py\ + ^J2 E \Py~ Py\ ( 17 ) 

y.p y <l/n V-Py>l/n 

and bounding each term separately. To bound the second term, we note, as in the proof of 
Lemma 8, that 



y/Vai[py] < 



1 + AGO 

V\py-p y \ < VVar[^] < \\Py n ^_ e y 2/ € N. (18) 

To bound the first term, we recall the indicator variables ^ defined in (10) and observe 
that 



nE\p y -py\ = E 



n 



i=i 



< TiE^-Py 

= 2npy(l - p y ) < 2npy, 

where stationarity was used in the last line of the derivation. 

Combining the last display with (17) and (18) yields the claim. □ 

Proof of Theorem 3. (a) Since obviously 



y\py — * 0. 

^ — ^ n— inn 

f 

it suffices to show that 



n— s-oo 
p y <l/n 



-^JZVPy — y °> 

which is proved in [4, Lemma 10]. 



Tl <■ — » n— >oo 

p y >l/n 
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(b) The claim follows from (16) and the fact that for any two distributions (f>, ip G M N , 

\\<j>-il>\\ TV = S yw\<j>(E)-4(E)\. 

ECN 

□ 

2.6 The stationarity assumption 

For rapidly mixing Markov and hidden Markov chains, the stationarity assumption can easily 
be relaxed. Indeed, Let Y = (Yi,...,Y n ) be a (G, #)-geometrically ergodic hidden Markov 
chain with parameters (Bit' , A, B), where it' G M n is some stochastic vector. If Y is "nearly 
stationary," in the sense that \\ir — tt'\\ tv is small, a simple dimension- free bound on the 
statistical distance between Y and its stationary version is available. 

Theorem 10. Let Y' = (Y{, . . . ,Y' t ) be the stationary version of Y — i.e., an EMM with 
'parameters (Bit, A, B), where ir is the stationary distribution of the kernel A. Then 

\\£00-£{Y')\\ TV <\\«-*>\\ TV . 

First, we prove an analogous result for Markov chains. 

Lemma 11. Let A be Markov kernel and G M N two arbitrary stochastic vectors. Let 
X = (Xi, . . . , X n ) and X' = (X[, . . . , X' n ) be the Markov chains induced by (£, A) and (£', A), 
respectively. Then 

\\c{x)-c(x')\\ Tw = \\i-i'\\ Tv . 

Proof. 

\\c{x) - c(x')\\ ty = \ Y, -Uv C)^,., l,,.,. 



2 ^2fl •■• ^n,I»-l |^1 6xi | 

xeN n 

oE 1^-41 = Ik -e'll 

TV 



2 ^ 



□ 



Proof of Theorem 1 0. Lemma 6 lets us restrict our attention to the underlying Markov chains 
X and X', respectively: 

WC^^) - C(Yl< t < n )\\ Tv < ||A*l<i<n)-A*l<i<n)||Tv 

= \\C(X 1 )-C(X' 1 )\\ TV = \\ir-^\\ TV , 
where the first identity follows from Lemma 11. □ 

Corollary 12. Let Y±, Y2, . . . be a (not necessarily stationary) N-valued (G, 9) -geometrically 
ergodic hidden Markov chain with stationary distribution p = Btt and initial distribution 
p' = Btt. Then the deviation bounds stated in Theorems 2 and 3 hold with an additive 
correction of \\ir — vr'|| TV on the right-hand side. 

Letting a (G, #)-geometrically ergodic chain run for s steps before starting the estimation 
ensures that \\n — n'\\ TV < G6 S . 
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